{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Assignment\n",
    "\n",
    "## 问题 1\n",
    "\n",
    "- 综合分词工具和正则表达式提取邮件签名档\n",
    "\n",
    "- 下面有几个来自真实邮件的签名档，请尽可能提取下面的关键字段\n",
    "    - 姓名\n",
    "    - 单位\n",
    "    - 电话号码\n",
    "    - 电子邮件\n",
    "- 数据如下\n",
    "\n",
    "> 刘三 Liu, San  \n",
    "+86 15912348765  \n",
    "sfghsdfg@abc.org.cn    \n",
    "\\--------------------------   \n",
    "> 李四  \n",
    "北清大数据产业联合会   \n",
    "电话：010-34355675  \n",
    "邮箱：lisi@beiqingdata.com  \n",
    "地址：北京市海淀区北清大学东楼201室    \n",
    "\\--------------------------  \n",
    "> John Smith  \n",
    "Data and Web Science Group  \n",
    "University of Mannheim, Germany    \n",
    "http://dws.informatik.uni-mannheim.de/~johnsmith  \n",
    "Tel: +49 621 123 4567  \n",
    "\\--------------------------  \n",
    "> 王五  \n",
    "CSDN-全球最大中文IT技术社区（www.csdn.net）  \n",
    "电话:010-51661202-257  \n",
    "手机:13934567890  \n",
    "E-mail:gdagsdfs@csdn.net  \n",
    "QQ、微信：34534563  \n",
    "地址：北京市朝阳区广顺北大街33号院一号楼福码大厦B座12层  \n",
    "\\--------------------------  \n",
    "> 张三  \n",
    "北京市张三律师事务所|Beijing Zhangsan Law Firm  \n",
    "北京市海淀区中关村有条街1号，邮编：100080  \n",
    "No. 1 Youtiao Street , ZhongGuanCun West, Haidian District, Beijing 100080  \n",
    "Mobile: 15023345465|Email: dfgasedt@126.com  \n",
    "\n",
    "\n",
    "\n",
    "## 思路\n",
    "- 区分汉语, 英语\n",
    "- 用分词工具提取姓名\n",
    "- 用正则提取电话, 邮箱等通讯方式\n",
    "- 难点: 单位如何正确提取?\n",
    "    - 中文: 使用jieba, 添加自定义字典\n",
    "    - 英文: 使用NLTK\n",
    "\n",
    "## 问题\n",
    "- 因为数据量比较小, 可以针对特殊形式加条件, 那么如何才能找出更一般的提取办法, 使得能够应对更大的数据量."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# -*- coding: utf-8 -*-\n",
    "import os\n",
    "import re\n",
    "import jieba\n",
    "import jieba.posseg as pseg\n",
    "import pynlpir\n",
    "from nltk.tag import StanfordNERTagger"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preparation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Add models of NLTK \n",
    "os.environ[\"CLASSPATH\"] = \"/Users/xpgeng/Library/stanford-ner-2015-12-09\"  # Here I use the absolute directory\n",
    "os.environ[\"STANFORD_MODELS\"] = \"/Users/xpgeng/Library/stanford-ner-2015-12-09/models\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Building prefix dict from the default dictionary ...\n",
      "Loading model from cache /var/folders/8n/zjbv19sd0xg400w9dpmzvsd40000gn/T/jieba.cache\n",
      "Loading model cost 0.439 seconds.\n",
      "Prefix dict has been built succesfully.\n"
     ]
    }
   ],
   "source": [
    "# Load user dict\n",
    "jieba.load_userdict('user_dict.txt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Read data\n",
    "with open('data.txt', 'r') as f:\n",
    "    data = f.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 按\"--------\"将signature分开"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Divide data by \"-----\"\n",
    "p = re.compile(r'-{2,}')\n",
    "signature_list = p.split(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Add Tagger\n",
    "st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 因为中英签名混杂在一起, 光用jieba不能识别英文的组织, 故添加如下函数, 分辨中英文signature"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def judge_lang(words):\n",
    "    '''\n",
    "    Parameters: Someone's signature (str)\n",
    "    Return: Language\n",
    "    '''\n",
    "    word_list = filter(None, re.split(r',|\\s+', words))\n",
    "    for word in word_list:\n",
    "        if not word.isalpha():\n",
    "            return \"Chinese\"\n",
    "        else:\n",
    "            return \"English\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 对每一条signature提取相关消息, 由于结构相对比较简单, 并未将每个if...改写成function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def extract_information(signature, language, st):\n",
    "    '''\n",
    "    Parameters: \n",
    "        signature: str\n",
    "        language: \"Chinese\" or \"English\"\n",
    "        st: Stanford Tagger\n",
    "    Return:\n",
    "        information_dict: dict\n",
    "    '''\n",
    "    words_list = re.split(r'\\n|\\|', signature)    # split signature by \\n, |\n",
    "    organization_list = []\n",
    "    tel_list = []\n",
    "    email_list = []\n",
    "    name_list = []\n",
    "    \n",
    "    p_email = re.compile(r'\\w+@(\\w+\\.)+(\\w+)')\n",
    "    p_tel = re.compile(r'''(\\+([\\d|\\s]+)) # +86 132 2345 2345\n",
    "                   | (电话.+([\\d|\\s|\\-]+))\n",
    "                   | (Tel.+([\\d|\\s|\\-]+))\n",
    "                   | (手机.+([\\d|\\s|\\-]+))\n",
    "                   | Mobile.+([\\d|\\s|\\-]+)''', re.VERBOSE)\n",
    "    \n",
    "    for item in words_list:\n",
    "        \n",
    "        # extract tel list\n",
    "        if p_tel.search(item): \n",
    "            tel_group = p_tel.search(item).group()\n",
    "            tel_list.append(tel_group) \n",
    "        \n",
    "        # extract email list\n",
    "        elif p_email.search(item):\n",
    "            m = p_email.search(item).group()\n",
    "            email_list.append(m) \n",
    "            \n",
    "        # extract name and orgnization from Chinese signature\n",
    "        elif language == \"Chinese\":\n",
    "            words = pseg.cut(item)\n",
    "            flag_list = [flag for word, flag in words]\n",
    "            if 'nt' in flag_list:\n",
    "                organization_list.append(item)\n",
    "            elif 'nr'in flag_list:\n",
    "                name_list.append(item)\n",
    "        \n",
    "        # Extract name and orgnization from English signature\n",
    "        elif language == \"English\":\n",
    "            flag_list = [flag for word, flag in st.tag(item.split())]\n",
    "            if 'ORGANIZATION' in flag_list:\n",
    "                organization_list.append(item)\n",
    "            elif 'PERSON' in flag_list:\n",
    "                name_list.append(item)\n",
    "        else:\n",
    "            return \"Nothing to extract!\"\n",
    "    \n",
    "    information_dict = {\"name\": list(set(name_list)), \"tel\": tel_list, \n",
    "                        \"email\": email_list, \"organization\": organization_list}\n",
    "    return information_dict         "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 将结果写入txt中, 这里花费了些时间在调整格式...."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "姓名: 刘三 Liu, San\n",
      "单位:\n",
      "+86 15912348765\t\n",
      "Email: ['sfghsdfg@abc.org.cn']\n",
      "-------\n",
      "\n",
      "姓名: 李四\n",
      "单位:北清大数据产业联合会 \n",
      "电话：010-34355675\t\n",
      "Email: ['lisi@beiqingdata.com']\n",
      "-------\n",
      "\n",
      "姓名: John Smith\n",
      "单位:Data and Web Science Group University of Mannheim, Germany \n",
      "Tel: +49 621 123 4567\t\n",
      "Email: []\n",
      "-------\n",
      "\n",
      "姓名: 王五\n",
      "单位:CSDN-全球最大中文IT技术社区（www.csdn.net） \n",
      "电话:010-51661202-257\t手机:13934567890\t\n",
      "Email: ['gdagsdfs@csdn.net']\n",
      "-------\n",
      "\n",
      "姓名: 张三\n",
      "单位:北京市张三律师事务所 \n",
      "Mobile: 15023345465\t\n",
      "Email: ['dfgasedt@126.com']\n",
      "-------\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Write into result.txt\n",
    "with open('result.txt', 'a') as f:\n",
    "    for signature in signature_list:\n",
    "        info_dict = extract_information(signature, judge_lang(signature), st)\n",
    "        name = '姓名: {name}'.format(name=info_dict['name'][0])\n",
    "        organization = '单位:'\n",
    "        for item in info_dict['orgnization']:\n",
    "            organization += \"%s \" % item\n",
    "        tel = ''\n",
    "        for item in info_dict['tel']:\n",
    "            tel += \"%s\\t\" %item\n",
    "        email = \"Email: %s\" % info_dict['email']\n",
    "        info = '%s\\n%s\\n%s\\n%s\\n-------\\n'% (name, organization, tel, email)\n",
    "        f.write(info)\n",
    "        print info\n",
    "f.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 总结\n",
    "- 第一次做信息提取, 感觉始终是在做试验, 试验正则表达式是否正确提取自己需要的信息等等\n",
    "- 反而真正写代码并没有花费多少时间\n",
    "- 在结果输出上又花费了很久才调出自己满意的格式, 意识到信息提取了, 表达原来也要花时间...\n",
    "- 因为加入了词性标注, 运行速度比较慢, 代码还有改进的空间. \n",
    "- Anyway, 先拿出MVP再说!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
